Getty Images

Anomalo unveils quality monitoring for unstructured data

The data quality specialist's capabilities will enable customers to monitor unstructured text to ensure the health of data used to inform analytics and AI models and applications.

Anomalo unveiled new AI-powered data quality monitoring capabilities for unstructured text, which for the first time will enable the vendor's customers to gain insights into their unstructured data in addition to structured data.

The vendor introduced the new capabilities -- which are currently in private beta testing, with general availability expected before the end of the year -- on June 12 during Data + AI Summit, a user conference hosted by Anomalo partner Databricks in San Francisco.

Anomalo does not publicize its pricing, but once the features are generally available, customers will be charged on a consumption basis for using the vendor's data observability capabilities for unstructured text, according to Jeremy Stanley, Anomalo's co-founder and CTO.

Business intelligence historically focused on structured data such as financial records and transactions. Unstructured data such as text and images, meanwhile, was difficult to access and use for analysis, given its lack of a searchable structure, and so was often left unused.

Now, however, advancing technologies such as vector embeddings can assign structure to unstructured data, making it easier to discover and use to inform decisions. As a result, many enterprises are tapping into their unstructured data -- which is estimated to make up more than three-quarters of all data -- to provide a more comprehensive picture of their operations and make better-informed decisions.

Anomalo's new AI-powered monitoring of unstructured text is significant as it enables enterprises to identify and resolve quality issues in unstructured data efficiently.
Stephen CatanzanoAnalyst, Enterprise Strategy Group

Given that rise in use of unstructured data, Anomalo's new data quality monitoring capabilities for unstructured text are important, according to Stephen Catanzano, an analyst at TechTarget's Enterprise Strategy Group.

"Anomalo's new AI-powered monitoring of unstructured text is significant as it enables enterprises to identify and resolve quality issues in unstructured data efficiently," Catanzano said.

Based in Palo Alto, Calif., Anomalo is a 2018 startup that specializes in data observability. The vendor's tools can be used in concert with most major data management platforms. In addition to Databricks, Anomalo provides integrations with data management offerings from AWS, Google, Microsoft, Oracle and Snowflake.

Anomalo competitors, meanwhile, include fellow data observability specialists such as Datadog, Monte Carlo and DQLabs.

New capabilities

While unstructured data use is increasing because it enables enterprises to gain deeper insights into their operations than structured data alone, another reason it is taking on greater importance is its potential to inform generative AI models and applications, according to Catanzano.

Interest in generative AI exploded following OpenAI's launch in November 2022 of ChatGPT, which marked significant improvement in large language model (LLM) capabilities.

Analytics and data management platforms are often difficult to use, requiring coding knowledge and data literacy skills. Natural language processing is a technology that has the potential to reduce the complexity of such platforms by reducing coding requirements and assisting with data interpretation and analysis.

Generative AI models, when combined with an organization's proprietary data, enable true natural language interactions. As a result, many vendors have made generative AI a priority of their product development, while many enterprises have built models and applications combining LLM technology with their own data to make data management and analytics easier and more efficient.

However, for those models and applications to be as accurate as possible, they need unstructured data as well as structured data. And that data has to be high quality -- or else, even if the models and applications are properly trained, the models and applications will deliver low-quality outputs.

Essentially, data volume and data quality are critical for accurate AI.

"Unstructured data is now more crucial because organizations are increasingly using it for generative AI applications, requiring high volumes of varied data to train and fine-tune models effectively," Catanzano said. "Quality is emphasized now because poor-quality data can significantly impact the performance and reliability of AI models, particularly in generative AI."

Anomalo's new AI-powered monitoring of unstructured text, therefore, will address a real need once generally available, he continued.

The feature aims to enhance performance and privacy and reduce security risks by using AI to automatically evaluate text documents for characteristics such as inconsistencies, errors, duplication, sentiment, and personally identifiable information and other sensitive data. Following monitoring, users can evaluate the quality of documents and identify any issues that need to be addressed before their text is used to inform models, applications and other data products.

"[The new capability] ensures the data used in AI models is reliable and safe, thus improving model performance and compliance," Catanzano said.

The six dimensions of data quality include accuracy, completeness, consistency, validity, integrity and uniqueness.

Andy Thurai, an analyst at Constellation Research, similarly said Anomalo's unstructured data monitoring capabilities target a real need as organizations develop generative AI models and applications.

Data drift -- the change in the properties underlying data used to train models -- is a significant issue, especially with unstructured text, he noted. Tools that can address data drift and other data quality issues to maintain data accuracy are vital to maintaining the accuracy and quality of the models themselves.

"Solutions such as this can help curate clean collections of documents by discovering and eliminating content that doesn't abide by [a customer's] corporate policy and governance," Thurai said. "This will help avoid model skew during training and fine-tuning, and also can potentially avoid issues in the inference phase."

In addition, by using AI to automatically monitor data quality, Anomalo's capabilities save the time and expense it would take for data engineers to manually observe all the data required to train models and applications, he added.

Anomalo is not the first vendor to address unstructured data monitoring. For example, Acceldata acquired Bewgle in September to add unstructured data observability, and Monte Carlo in November developed an integration with vector database specialist Pinecone that enables joint customers to monitor vectorized unstructured data.

Meanwhile, more broad-based data management vendors such as Collibra, Informatica and Qlik provide unstructured data quality monitoring capabilities.

The impetus for Anomalo's new capabilities came from customer interactions, according to Stanley.

Some of the vendor's users are among the many enterprises now developing and deploying AI applications that require large volumes of unstructured data for accuracy. And given the need for that unstructured data to be trustworthy -- for it to be of high enough quality so that models and applications can move beyond the pilot stage and into production -- those users requested that Anomalo develop data quality monitoring capabilities for more than just structured data.

Future plans

While Anomalo's initial unstructured data observability capabilities focus on text analysis, the vendor plans to develop capabilities that monitor additional unstructured data types, according to Stanley.

Audio files can be transcribed using AI capabilities, essentially converting them to text. That will be the next extension of Anomalo's unstructured data observability capabilities.

"Audio support will be enabled via AI transcription, the text results of which can be fed directly into our existing monitoring capabilities," Stanley said. "We don't have plans to extend this solution to images, video or other proprietary formats, but we will revisit based on customer feedback."

Thurai, meanwhile, said Anomalo would be wise to expand its offerings to include more than just data observability. For example, data lineage represents an opportunity to add new tools that could help customers develop trusted AI models and applications.

"I would love to see them expand to include more capabilities," Thurai said. "While data quality is a major concern for enterprises, data freshness, data lineage tracking, data downtime and even data transparency are a big deal for AI training data in big volumes."

Catanzano similarly noted that in addition to expanding its unstructured data quality monitoring beyond text, there are other ways for Anomalo to grow its platform. For example, real-time data quality monitoring would be beneficial to enterprises, given that real-time decisions are often required amid quickly changing conditions such as supply chain disruptions and economic swings.

"It would be beneficial for Anomalo to continue expanding its capabilities by integrating more advanced analytics, real-time monitoring and automated remediation features to further enhance data quality management," Catanzano said.

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data management strategies